16  Exploratory Analysis I

Learning Objectives

After completing this tutorial you should be able to

Download the directory for this project here, make sure the directory is unzipped and move it to your bi328 directory. You can open the Rproj for this module either by double clicking on it which will launch Rstudio or by opening Rstudio and then using File > Open Project or by clicking on the Rproject icon in the top right of your program window and selecting Open Project.

There should be a file named 16_exploratory-analysis-i.qmd in that project directory. Use that file to work through this tutorial - you will hand in your rendered (“knitted”) quarto file as your homework assignment. So, first thing in the YAML header, change the author to your name. You will use this quarto document to record your answers. Remember to use comments to annotate your code; at minimum you should have one comment per code set1 you may of course add as many comments as you need to be able to recall what you did. Similarly, take notes in the document as we discuss discussion/reflection questions but make sure that you go back and clean them up for “public consumption”.

  • 1 You should do this whether you are adding code yourself or using code from our manual, even if it isn’t commented in the manual… especially when the code is already included for you, add comments to describe how the function works/what it does as we introduce it during the participatory coding session so you can refer back to it.

  • Let’s make sure to read in the libraries we need for this analysis.

    # install janitor
    install.packages("janitor")
    
    # install skimr
    install.packages("skimr")

    Now we can load all of our libraries and get started

    # load libraries
    
    # reporting
    library(knitr)
    
    # visualization
    library(ggplot2)
    library(ggthemes)
    library(patchwork)
    
    # data wrangling
    library(dplyr)
    library(tidyr)
    library(readr)
    library(skimr)
    library(janitor)
    library(magrittr)
    
    # turn off scientific notation
    options(scipen=999)

    16.1 Exploratory Analysis

    An exploratory analysis is systematic way of exploring data set primarily through visualizations. Generally, it is an iterative process that starts with a question or hypothesis and then, most of your exploratory analysis consists of visualizing, transforming, and even modeling your data to answer that question. Finally, you need to summarize your results and conclusions and then use those answers to refine your original question or generate new questions. Frequently, an exploratory analysis is the first step before deciding on what your formal analysis will look like and can serve to generate hypothesis for further exploration.

    Part of exploratory analysis is quality control of your data, determining if it meets your expectations and cleaning it up as needed. Frequently it includes using the existing raw data to transform your data into parameters that are more useful for your assessment.

    16.2 Step 1: Formulate a question

    Recall, from our visualization of Climate Change indicators that we left the question of whether or not hurricanes are going “to get worse” unresolved even after comparing both changes in the number of names storms, major hurricanes, and their intensity.

    Let’s say we wanted to explore whether or not natural disasters have gotten worse as climate change has progressed. To do this we can turn to the International Disaster Database.

    Consider this

    Go to their website and use information you can find about the data collected to answer the following questions (use bullet points where appropriate):

    1. How is (emergency) disaster defined?
    2. What are the major classifications of natural disasters included in the data set?
    3. Which of these categories do you think are most likely impacted by climate change?
    4. What measures can you use to determine if natural disasters are “getting worse”? Should different categories have different measures?
    5. What other mechanisms do you think could result in an increase in natural disasters?
    Did it!

    [Your answer here]

    Consider this

    “Are natural disasters getting worse?” is a pretty general question. Based on your overview of what is in the data set, try to formulate at least three more specific questions that you could explore using this data set.

    Did it!

    16.3 Read & Wrangle data set

    You should have a copy of the tab-delimited file in your data folder. It contains an additional 6 lines at the top that have information on when and where it was downloaded.

    disaster <- read_delim("data/nat-disasters_emdat-query-2021-10-05.txt", delim = "\t", skip = 6)

    Our first step always is to take a look at the data set and make sure that it has read in correctly (i.e. that the columns are correctly separated, numeric columns are numeric, character columns are characters etc.) and is in a usable format.

    Consider this

    Take a quick look at the data set using View() to make sure everything has read in correctly. Quickly skim the dataframe to get an idea of how many columns/rows there are, how much missing data there is, that data types are correct etc.

    Did it!

    [Your answer here]

    Even though data has read in correctly does not necessarily mean that it is in the most use-able format … remember plotting and filtering data sets requires typing variable names in over and over again. Therefore, we want to make sure that everything is set up in a way where we will be able to avoid mistakes that are mostly due to typos.

    For example, let’s take a look at the column names:

    colnames(disaster)
     [1] "Dis No"                          "Year"                           
     [3] "Seq"                             "Glide"                          
     [5] "Disaster Group"                  "Disaster Subgroup"              
     [7] "Disaster Type"                   "Disaster Subtype"               
     [9] "Disaster Subsubtype"             "Event Name"                     
    [11] "Country"                         "ISO"                            
    [13] "Region"                          "Continent"                      
    [15] "Location"                        "Origin"                         
    [17] "Associated Dis"                  "Associated Dis2"                
    [19] "OFDA Response"                   "Appeal"                         
    [21] "Declaration"                     "Aid Contribution"               
    [23] "Dis Mag Value"                   "Dis Mag Scale"                  
    [25] "Latitude"                        "Longitude"                      
    [27] "Local Time"                      "River Basin"                    
    [29] "Start Year"                      "Start Month"                    
    [31] "Start Day"                       "End Year"                       
    [33] "End Month"                       "End Day"                        
    [35] "Total Deaths"                    "No Injured"                     
    [37] "No Affected"                     "No Homeless"                    
    [39] "Total Affected"                  "Reconstruction Costs ('000 US$)"
    [41] "Insured Damages ('000 US$)"      "Total Damages ('000 US$)"       
    [43] "CPI"                             "Adm Level"                      
    [45] "Admin1 Code"                     "Admin2 Code"                    
    [47] "Geo Locations"                  

    Apparently, they have not read our guidelines for naming things. Column headers with special characters and spaces are super annoying, you might recall from our previous adventures in data wrangling that every time we want to call a column with spaces we are going to have to use back ticks to tell R that it isn’t separate words but a single name.

    Now is probably a good time to learn how to rename columns. The most straightforward way is using rename().

    Let’s say we wanted to rename Dis No to DisNo the syntax is simple NewName = OldName.

    disaster %>%
      rename(DisNo = `Dis No`) %>%
      head()
    # A tibble: 6 × 47
      DisNo    Year Seq   Glide `Disaster Group` `Disaster Subgroup` `Disaster Type`
      <chr>   <dbl> <chr> <chr> <chr>            <chr>               <chr>          
    1 1900-9…  1900 9002  <NA>  Natural          Climatological      Drought        
    2 1900-9…  1900 9001  <NA>  Natural          Climatological      Drought        
    3 1902-0…  1902 0012  <NA>  Natural          Geophysical         Earthquake     
    4 1902-0…  1902 0003  <NA>  Natural          Geophysical         Volcanic activ…
    5 1902-0…  1902 0010  <NA>  Natural          Geophysical         Volcanic activ…
    6 1903-0…  1903 0006  <NA>  Natural          Geophysical         Mass movement …
    # ℹ 40 more variables: `Disaster Subtype` <chr>, `Disaster Subsubtype` <chr>,
    #   `Event Name` <chr>, Country <chr>, ISO <chr>, Region <chr>,
    #   Continent <chr>, Location <chr>, Origin <chr>, `Associated Dis` <chr>,
    #   `Associated Dis2` <chr>, `OFDA Response` <chr>, Appeal <chr>,
    #   Declaration <chr>, `Aid Contribution` <dbl>, `Dis Mag Value` <dbl>,
    #   `Dis Mag Scale` <chr>, Latitude <chr>, Longitude <chr>, `Local Time` <chr>,
    #   `River Basin` <chr>, `Start Year` <dbl>, `Start Month` <dbl>, …

    This would be pretty annoying if we wanted to rename all of our columns by hand.

    Fortunately, we can leverage the function janitor::clean_names() function2 which is designed to parse letter cases and separators - the default is to convert it to snake_case, i.e. all names are lowercase and words are separated by an underscore, it also deals with duplicate names and special characters.

  • 2 The notation format you see here specifies the Rpackage and then the function within it, as package::function().

  • Let’s see what this does for us:

    disaster <- read_delim("data/nat-disasters_emdat-query-2021-10-05.txt", delim = "\t", skip = 6) %>%
      clean_names()
    
    colnames(disaster)
     [1] "dis_no"                      "year"                       
     [3] "seq"                         "glide"                      
     [5] "disaster_group"              "disaster_subgroup"          
     [7] "disaster_type"               "disaster_subtype"           
     [9] "disaster_subsubtype"         "event_name"                 
    [11] "country"                     "iso"                        
    [13] "region"                      "continent"                  
    [15] "location"                    "origin"                     
    [17] "associated_dis"              "associated_dis2"            
    [19] "ofda_response"               "appeal"                     
    [21] "declaration"                 "aid_contribution"           
    [23] "dis_mag_value"               "dis_mag_scale"              
    [25] "latitude"                    "longitude"                  
    [27] "local_time"                  "river_basin"                
    [29] "start_year"                  "start_month"                
    [31] "start_day"                   "end_year"                   
    [33] "end_month"                   "end_day"                    
    [35] "total_deaths"                "no_injured"                 
    [37] "no_affected"                 "no_homeless"                
    [39] "total_affected"              "reconstruction_costs_000_us"
    [41] "insured_damages_000_us"      "total_damages_000_us"       
    [43] "cpi"                         "adm_level"                  
    [45] "admin1_code"                 "admin2_code"                
    [47] "geo_locations"              

    That looks much better.

    16.4 Step 2: Get an overview of the data set

    Our next step always is to determine what information is contained in the data set and how it is organized.

    First, we will need to understand what information is contained in each column. Some of the column headers are more self-explanatory than others. One of the most helpful documents at this point is the metadata which will tell us what information is in each column. A good place to look for that is the data portal where you accessed the data set itself.

    Consider this

    Determine where you can find meta-data for our data set and describe what information is contained in the data set based on the included columns. Indicate which you think will be most useful to answer our general question based on your answers above.

    Did it!

    [Your answer here]

    Now, we can take a look at some of the specifics of how the information is organized in the data set and specifics on the information contained.

    Consider this

    Based on some of our previous exploits exploring data, come up with a checklist of at least five things to routinely check any time you are exploring a new data set, wherever possible include what function(s) you could use to look at that.

    Did it!

    Factors to consider are among other things understanding how many rows/columns there are, what variables are included, whether it is in a tidy format or not, what data type each variable is, and what the range/distribution of values is.

    You can get dimensions of the data set using ncols() and nrows(), str() will give you dimensions, the class, and data types (classes) of individual columns.

    Next to all of the functons you just listed, another helpful functions is skimr::skim()

    Give it a whirl

    Run the function on your data set and briefly describe what information you can get3.

    skim(disaster)
    Name disaster
    Number of rows 16121
    Number of columns 47
    _______________________
    Column type frequency:
    character 29
    logical 1
    numeric 17
    ________________________
    Group variables None
    Data summary

    Variable type: character

    skim_variable n_missing complete_rate min max empty n_unique whitespace
    dis_no 0 1.00 13 13 0 16121 0
    seq 0 1.00 4 4 0 1274 0
    glide 14540 0.10 4 25 0 1086 0
    disaster_group 0 1.00 7 7 0 1 0
    disaster_subgroup 0 1.00 10 17 0 6 0
    disaster_type 0 1.00 3 21 0 15 0
    disaster_subtype 3109 0.81 6 32 0 27 0
    disaster_subsubtype 15044 0.07 4 23 0 12 0
    event_name 12263 0.24 2 76 0 1561 0
    country 0 1.00 4 58 0 228 0
    iso 0 1.00 3 3 0 228 0
    region 0 1.00 9 25 0 23 0
    continent 0 1.00 4 8 0 5 0
    location 1792 0.89 3 2878 0 12755 0
    origin 12329 0.24 4 118 0 660 0
    associated_dis 12774 0.21 3 29 0 30 0
    associated_dis2 15415 0.04 3 29 0 30 0
    ofda_response 14427 0.11 3 3 0 1 0
    appeal 13552 0.16 2 3 0 2 0
    declaration 12866 0.20 2 3 0 2 0
    dis_mag_scale 1190 0.93 3 10 0 5 0
    latitude 13393 0.17 1 10 0 2372 0
    longitude 13390 0.17 1 12 0 2438 0
    local_time 15019 0.07 4 8 0 778 0
    river_basin 14835 0.08 1 402 0 1213 0
    adm_level 8262 0.49 1 3 0 3 0
    admin1_code 11540 0.28 3 399 0 3434 0
    admin2_code 12152 0.25 3 1571 0 3446 0
    geo_locations 8262 0.49 10 2525 0 6355 0

    Variable type: logical

    skim_variable n_missing complete_rate mean count
    reconstruction_costs_000_us 16121 0 NaN :

    Variable type: numeric

    skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
    year 0 1.00 1996.76 20.16 1900.00 1989.00 2001.00 2011.00 2021 ▁▁▁▃▇
    aid_contribution 15444 0.04 125413.61 2997874.55 1.00 175.00 721.00 3511.00 78000000 ▇▁▁▁▁
    dis_mag_value 11177 0.31 47369.51 309485.37 -57.00 7.00 152.00 11315.00 13025874 ▇▁▁▁▁
    start_year 0 1.00 1996.77 20.15 1900.00 1989.00 2001.00 2011.00 2021 ▁▁▁▃▇
    start_month 387 0.98 6.44 3.39 1.00 4.00 7.00 9.00 12 ▇▅▆▆▇
    start_day 3628 0.77 15.24 8.95 1.00 7.00 15.00 23.00 31 ▇▆▆▆▅
    end_year 0 1.00 1996.83 20.14 1900.00 1989.00 2001.00 2011.00 2021 ▁▁▁▃▇
    end_month 708 0.96 6.58 3.35 1.00 4.00 7.00 9.00 12 ▇▅▆▇▇
    end_day 3556 0.78 15.78 8.87 1.00 8.00 16.00 24.00 31 ▇▇▇▇▆
    total_deaths 4713 0.71 2844.11 68620.96 1.00 6.00 20.00 63.00 3700000 ▇▁▁▁▁
    no_injured 12227 0.24 2621.70 34407.82 1.00 14.00 50.00 200.00 1800000 ▇▁▁▁▁
    no_affected 6903 0.57 882551.63 8574833.40 1.00 1245.50 10000.00 91941.00 330000000 ▇▁▁▁▁
    no_homeless 13692 0.15 73323.11 523111.36 3.00 574.00 3000.00 17500.00 15850000 ▇▁▁▁▁
    total_affected 4507 0.72 716692.93 7719586.75 1.00 650.00 5967.00 58255.00 330000000 ▇▁▁▁▁
    insured_damages_000_us 15025 0.07 798651.42 3057638.17 34.00 50000.00 172500.00 500000.00 60000000 ▇▁▁▁▁
    total_damages_000_us 10876 0.33 724783.54 4723130.53 2.00 8300.00 60000.00 317300.00 210000000 ▇▁▁▁▁
    cpi 310 0.98 63.22 26.73 3.22 45.69 68.42 84.25 100 ▃▂▅▇▇
  • 3 You may want to adjust the width of your console pane toget a better look at the output.

  • Did it!

    [Your answer here]

    For numerical columns mean values and percentiles can give you an overview of how the data is distributed. However, for columns that contain strings or characters we cannot produce those types of summary statistics. Rather, we are probably more interested on how many unique values those columns contain and what those unique entries are.

    skim() already gave us the number of unique entries for each column, we have two options we can use to determine what those values are.

    Give it a whirl

    We’ve previously learned about two functions that return the unique entries for a vector and for a dataframe. Apply both of them below to output the unique entries of the the disaster subgroups directly to the console/your html report. Then pick two more columns where you think it would be important to know how many and which values are represented4 and run a function to get that information.

  • 4 These should probably be categorical values.

  • Did it!

    [Your answer here]

    The first is unique(). This function makes use of the fact that each column is a vector5.

    unique(disaster$disaster_subgroup)
    [1] "Climatological"    "Geophysical"       "Meteorological"   
    [4] "Hydrological"      "Biological"        "Extra-terrestrial"

    When you run this function it returns a vector; while this might be helpful in some contexts, it would be helpful if we could also view it in a tabular format. Do not fret - tidyverse got you.

    disaster %>%
      distinct(disaster_subgroup)
    # A tibble: 6 × 1
      disaster_subgroup
      <chr>            
    1 Climatological   
    2 Geophysical      
    3 Meteorological   
    4 Hydrological     
    5 Biological       
    6 Extra-terrestrial
  • 5 Remember, we can access each column of a dataframe as `df$columnname

  • 16.5 Step 3: Chose a question to explore

    Frequently you will have a general question which is probably why you pulled a data set in the first place. In our example that would be something along the lines of “Are natural disasters getting worse with climate change?”.

    Usually, there will be more than one way to go after that question and you will likely want and need to do some extended exploration of the data set to generate additional questions (hypothesis) and ultimately find the one you are mot interested in.

    In this case, we could focus on whether we are indeed observing an increase in frequency of events and/or we could compare if different groups of natural disasters that are more dependent on climate/weather conditions (storms, hurricanes) are increasing more rapidly compared to categories that have other origins (e.g. earthquakes or volcanic eruptions).

    Or, we could have already done some analysis and/or background reading that indicates that natural disasters linked to climate patterns are indeed predicted to get worse based on modeling. In that case we might be more interested whether the impact of natural disasters on humans is getting increasingly worse. We could be at a point where at least some of the changes to the climate system are inevitable and in that case the question of mitigation of effects and preparedness like early warning systems or infrastructure become more important. We cannot stop the events from happening but to an extent we can control how well we can deal with them.

    Exploratory Analysis is fundamentally a very creative process and thinking outside the box can be what sets your analysis apart from what everyone else is doing. However, before we can start thinking outside the box we need to first figure out what “the box is” and use that as our starting point. Unfortunately, this does mean you should always start with the boring standard stuff, cover your bases and then from there work towards something more interesting. However, not infrequently the “boring” stuff is the most revealing!

    16.6 Step 4: Plot the data!

    While tables are sometimes more appropriate, most exploratory analysis involves generating plots that help you get an overview of the data in relation to your question.

    Consider this

    We should make a distinction between exploratory figures and explanatory or final figures. Briefly contrast the two in terms of their goal and audience and how this might be reflected in the presentation of the visualization.

    Did it!

    [Your answer here]

    16.6.1 Explore the distribution of the data

    A good place to start is to get an idea of the variation or variance of the data you are interested in.

    Consider this

    List 3 - 5 ways that you can assess the variance of a measured value. What metrics can you calculate and what options do we have to visualize?

    Did it!

    [Your answer here]

    You can calculate summary statistics - we have already encountered the base R functions that will allow you to do this.

    • mean (mean()), median (median())
    • mininimum (min()), maximum (max()) values to get the range of values
    • quantiles (quantile())

    (fivenum() will give you min, 25th percentile, median, 75th percentile, max)

    To visualize, you can use

    • barplots for categorical values (geom_bar(stat = "identity"))
    • histograms for continuous values (geom_histogram(); here your individual bins become components of a categorical value)
    • boxplots to compare distributions of continuous values (geom_boxplot())

    For any of these assessments your should always ask the following questions

    • Which values are more common than others? Why?
    • Which values are more rare than others? Why?
    • Are there distinct clusters that emerge? Why does/doesn’t that make sense?
    • Do overall patterns meet my expectations? What could cause deviations? Why do I have these expectations of the data?
    • Are there patterns that stand out? Are they unusual/unexpected? What could be causing them?

    In each case, as you pick up on patterns you want to also ask “why” and “how come” questions6 … those are the ones that will lead you to the next steps of your exploration.

  • 6 Imagine a cute toddler right next to you asking “why” at every step.

  • Typically, you should always explore outlier values to determine whether they are true outliers or if it is legitimate to remove them, along with patterns of missing data. Keep in mind that if there is too much missing data that could be an indication that you should be looking or a better data set.

    Let’s get started. You will hopefully quickly see that for any type of exploratory analysis you start with one plot/question to visualize a pattern and then typically go through a few steps of plotting that same data set component in a few different ways to get a view from a few different angles before moving on to the next thing. Be patient and try various things even if some things you pursue end up being dead ends.

    Let’s start by figuring out how many observations fall into each sub-category of natural disasters7.

  • 7 We finally get to see the alternative to stat = "identity" in using barplots! To plot distribution of categorical variable you are now telling ggplot to count the number of observations for each category

  • ggplot(disaster, aes(x = disaster_subgroup)) +
      geom_bar(stat = "count")

    and we could do the same for our disaster types.

    ggplot(disaster, aes(x = disaster_type)) +
      geom_bar(stat = "count")

    We probably want to be able to read those labels so let’s flip them by 90 degrees.

    ggplot(disaster, aes(x = disaster_type)) +
      geom_bar(stat = "count") +
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

    This is giving us global numbers, we might want to break it up by region.

    ggplot(disaster, aes(x = disaster_type)) +
      geom_bar(stat = "count") +
      facet_wrap(. ~ region) +
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

    Hm, that’s interesting.

    The next step is that our questions is connected to understanding the “getting worse” component so we have to consider ways we can measure “worseness”. One way to do that is impact. In this case, we ould want to filter our data set to include only events for which damages where assessed.

    Give it a whirl

    Create a new object called damages that includes only entries for observations of total damages in the US were made (total_damages_000_us) and then get an overview of distribution of the damage incurred by plotting a histogram..

    Did it!

    [Your answer here]

    Remember that you can subset rows using !is.na() as a condition to retain only observations that have a values (i.e. that aren’t na).

    This is what your figure should look like

    #|echo: false
    
    ggplot(damages, aes(x = total_damages_000_us)) +
      geom_histogram()

    Ugh, that plot does not initially look too helpful, it seems like almost all of our values vall into the same bin. This could be because we have some outliers that are skewing our histogram.

    We can get a better idea by using e.g. a square root scale, which skews the y-axis in a systematic way8.

  • 8 This is the type of thing where in order to not be unintentionally mislead you would always want to add “Note the transformed y-scale”

  • ggplot(damages, aes(x = total_damages_000_us)) +
      geom_histogram() +
      scale_y_sqrt()

    Give it a whirl

    Another option would be to set the binwidths manually.

    Create a new column that contains your dollar amounts divided by 1,000,000 to make the x-axis more legible. Then create a histogram showing the damages binned by $5 Million. Name your x-axis to reflect that this is damages in millions of dollars.

    Then create a second plot with individual panels showing the damages broken down by disaster type.

    Did it!

    [Your answer here]

    Here is what your first figure should look like

    And this figure shows the variation in damages for each disaster type.

    Based on our second figure, it looks like we might want to limit our assessment to droughts, earthquakes, wildfires, floods, and storms, using boxplots insteed of historgrams is also going to make these comparisons easier to make.

    Give it a whirl

    Subset your damages dataframe to contain only observations of droughts, earthquakes, wildfires, floods, and storms; no need to creat new dataframe, just overwrite the one you currently have.

    Then create a boxplot comparing the distribution of damages in millions of dollars per disaster type. Make sure to update the x- and y-axis labels to be descriptive.

    Did it!

    [Your answer here]

    Here’s what your figure should look like

    From your figure, you’ve probably seen that the big differences seems to be what the extreme values are in eah group but it is difficult to tell how different the mean, median and general spread of the data are.

    Give it a whirl

    Mitigate this by creating a table (make sure it prints neatly to your console/html report) that contains the mean, median, standard deviation, minimum and maximum value of damages in Millions of USD for each disaster type.

    Did it!

    [Your answer here]

    Here is what a table with summary stats should look like.

    disaster_type median mean sd min max
    Drought 0.1100 0.8805004 2.045504 0.000050 20.0
    Earthquake 0.0388 1.4403066 10.581451 0.000002 210.0
    Flood 0.0495 0.4972023 1.992408 0.000003 40.0
    Storm 0.0760 0.7409707 4.311794 0.000005 125.0
    Wildfire 0.1020 0.7286744 2.037236 0.000600 16.5

    16.7 Step 5: Explore relationships among variables

    After you’ve explored patterns within variables you are interested in you are also going to want to explore relationships (Covariation) among variables9.

  • 9 Remember, when we start looking at correlation, we always need to be careful about how/when we infer causation!

  • Consider this

    List 3 - 5 ways that you can assess the covariance of a measured value.

    Did it!

    [Your answer here]

    You can calculate summary statistics and compare them, but for initial exploration of patterns visualizations are going to be much more handy.

    You could create faceted barplots or have bars next to each other in a figure; faceted histograms can be helpful too. But especially with increasing number of comparisons, one of the most straightforward ways to compare distributions are boxplots.

    Comparing two categorical values can be challenging - heat plots can be used to visualize the number of observations that fulfill both categories, for two continuous values you can use scatter plots (though these become less useful with increasing number of observations in your data set).

    Are damages incurred by natural disasters increasing over time?

    We have already spent a little bit of time looking at more than one variable by looking at subgroups within our data set, specifically the disaster types and regional differences.

    However, given the initial question we are asking of the data set, the relationship we are really interested in is the change over time. This means we are technically comparing two continuous values but in this specific case we frequently describe as a time-series.

    Let’s take a look at the temporal patterns we can observe.

    Give it a whirl

    First create a plot that shows the change total damages incurred over time.

    Did it!

    [Your answer here]

    Here is what your plot should look like:

    That’s helpful, but from this figure it is quite difficult to tell whether this is a consistent pattern, or if it is one that is driven for example by a specific type of disaster.

    We could enhance this figure and our understanding of the data set by e.g. color coding by disaster type.

    Give it a whirl

    Re-plot the time-series of damages incurred but now color code the individual data points by disaster type.

    Did it!

    [Your answer here]

    Here is what your figure should look like.

    ggplot(damages, aes(x = year, y = total_damages_000_us/1000000, fill = disaster_type)) +
      geom_point(shape = 21, size = 2)

    While that gives us a bit of a better idea of the pattern in the data, it is still difficult to tell if there are differences between the disaster types - this also does not let us parse whether this is a consistent pattern across the globe or if it is for example, driven by specific geographic regions.

    Give it a whirl

    Take a closer look at the data by plotting the damage incurred by natural disasters. Color code individual data points by continent and create individual panels for each disaster type.

    Did it!

    [Your answer here]

    Here is what your plot could look like.

    Is the magnitude of the disaster correlated to the damages it incurs?

    Another important relationship to explore is whether the damages incurred by a given natural disaster scales with its magnitude. Both frequency and intensity of climate-related natural disasters are expected to increase so this is an important relationship to understand.

    Give it a whirl

    The magnitude of each of these types of natural disasters is measured in a very different way so to plot the relationship of the magnitude (dis_mag_value) and total damages incurred you will need to subset the data set to hold only observations for one disaster type to plot it.

    Create a plot showing this relationship for storms, floods, earthquakes, droughts, and wildfires in individual plots using patchowrk10.

    Remember that you can use #| fig-height: and #| fig-width: to make sure that the figure looks good in your html report. This does require rendering your html, checking it and then adjusting the height and width of the figures in the output as needed.

  • 10 Check out the layout guidelines for patchwork if you need a refresher on how to combine plots.

  • Did it!

    [Your answer here]

    Here is what your plot could look like. I additionally color coded the individual data points by year to see if there were any patterns that popped out.

    So far we haven’t really found any patterns that really pop out - though keep in mind that “non-patterns” are also noteworthy in an of themselves11.

  • 11 Recall that we’ve discussed the importance of creating hypothesis that are falsifiable. Even if we thing there should be certain relationships, during an exploratory analysis we always need to be open to the idea that certain relationships are not in the data and that we cannot try to manipulate the data to make it seem as if there were.

  • can we create metrics that help us better understand important relationships?

    We do know from our previous plots that we tend to have a pretty tight distribution with damages being pretty similar across events. However, you have probably also noticed that we always see that there are individual events that are outliers in terms of how much damage they incur.

    Since are looking at very broad-scale patterns of it might make sense to transform our data in order to get a new metric based on the data at hand that enables a more helpful and informative comparison.

    In this case, it could be helpful to calculate the total costs incurred per year for each disaster type and then compare how those total values are changing over time.

    Let’s start by creating a new data frame that contains information on total damages incurred per year.

    total_yr <- disaster %>%
      filter(!is.na(total_damages_000_us)) %>%
      group_by(year, disaster_type) %>%
      summarize(total_damages_yr = sum(total_damages_000_us)/1000000000)

    Let’s plot that data set.

    Give it a whirl

    Use our new data frame to create a plot that shows the total damages incurred per year for each disaster type ina an individual panel..

    Did it!

    [Your answer here]

    Here is what your figure should look like.

    Now we’re getting somewhere.

    As we are looking at these plots, it is important to keep in mind that increase in total damages per year can be due to a combination of more individual events but all have small number of damages incurred or if we have more incidence of high cost events and of course, it could be a combination thereof.

    Let’s create a data set with a limited disaster types so we can elucidate patterns more easily, and while we’re at it also calculate frequency of each disaster type per year.

    Give it a whirl

    Subset the total_yr data set so that it only contains observations for droughts, earthquakes, extreme temperatures, floods, storms, and wildfires. Then create a data frame that contains total damages per year and disaster type as well as the tota number of events that occured that year.

    Then create a plot that contains the relationship of total damages incurred for each year with an indiviual panel for each disaster type.

    Did it!

    [Your answer here]

    Here is what your data set should look like:

    # A tibble: 6 × 4
    # Groups:   year [5]
       year disaster_type total_damages_yr events_yr
      <dbl> <chr>                    <dbl>     <int>
    1  1900 Storm                0.00003           1
    2  1902 Earthquake           0.000025          1
    3  1903 Flood                0.00048           1
    4  1905 Earthquake           0.0000488         2
    5  1906 Earthquake           0.000624          2
    6  1906 Storm                0.00002           1

    And then based off of that you can create the following plot.

    Now that we’ve discovered a better metric to be looking at, let’s quickly check what the relationship is of frequency and total damages.

    Give it a whirl

    Use your new data frame to plot the relationship between the number of events per year and the total damages incurred per year. Add a regression to help you spot relationshps.

    Did it!

    [Your answer here]

    Here is what this figure should look like.

    ggplot(total_yr, aes(x = events_yr, y = total_damages_yr)) +
      geom_point() +
      geom_smooth()

    Based off of those last two figure we’ve made, the number of events definitely appears to play a role in the total damages due to various disaster types per year, though we would need a more formal analysis to disentangle those effects.

    Pick the visualizations that best communicate the pattern you want to highlight

    Our multi-panel time-series data seems to have the data transformed in a way that will give us an answer to our question - but it’s not really a great visualization if we wanted to share results from our exploratory analysis. Generally, you’ll hit a point where you know have the data transformed in a way that is useful and you could start playing around with optimizing the presentation of that data set.

    At this point we are shifting from our exploratory figures to an explanatory figure that we might include in a report or an update to our collaborators.

    One of the problems with our multiplot visualization is that the scales for all of the y-axis are the same because we have some outlier that compress the scale. Maybe a traditional scatterplot is not ideal. Because we are interested in changes in magnitude we could try a less traditional plot, like a bubble chart. Instead of plotting the total damages incurred on the y-axis, we can plot our disaster types and then adjust the size of each bubble according to damages incurred.

    Give it a whirl

    Create a bubble chart using the following code. Add comments to describe what each function/parameter is doing. Then describe the figure. Be specific in terms of how the data is encoded and what the general pattern and notable results are.

    ggplot(total_yr, aes(x = year,                            #
                         y = disaster_type,                   #
                         size = total_damages_yr/1000000)) +  #
      geom_point()                                            #

    Did it!

    [Your answer here]

    Now we’re cooking!

    Let’s think of some ways that we can refine this plot. We have overlapping points, we also might want additional bin sizes.

    Give it a whirl

    Run the following code and then comment it line by line to describe what each function/argument is doing.

    ggplot(total_yr, aes(x = year,
                         y = disaster_type,
                         size = total_damages_yr/1000000,
                         fill = disaster_type)) +
      geom_point(shape = 21, alpha = .5) +
      scale_size_continuous(breaks = c(10, 25, 50, 100, 200),
                            name = "damages [USD]") +
      theme(legend.position = "bottom")

    Let’s see if we can fix our legend to something more visible, we might want different colors, and we’ll need to add axis labels, plot titles etc.

    Give it a whirl

    Run the following code and then comment line by line to describe what each function/argument is doing.

    ggplot(total_yr, aes(x = year,
                         y = disaster_type,
                         size = total_damages_yr/1000000,
                         fill = disaster_type)) +
      geom_point(shape = 21,
                 alpha = .5) +
      scale_fill_viridis_d() +
      scale_size_continuous(breaks = c(5, 10, 25, 50, 100, 200),
                            name = "damages [USD]") +
      labs(x = "total damages incurred [USD]", y = "disaster type",
           title = "Economic costs incurred by natural disasters (1900 - 2020)",
           subtitle = "The size of the bubble represents the total damages incurred by all events in a given category.",
           caption = "data: EMDAT/International Disaster Database") +
      theme_standard +
      theme(legend.position = "bottom") +
      guides(fill = FALSE,
             size = guide_legend(override.aes = list(fill = "black", alpha = 1)))

    can we infer/predict relationships between our sample and the whole population?

    A final step in our exploratory analysis is starting to ask whether patterns are specific to the data set (descriptive analysis) or whether there are true non-random relationships between variables (inferential analysis). The overview of the patterns in the data that we get through the exploratory analysis are really important to help us determine what questions/hypothesis are worth exploring more in depth.

    Typical questions we might further explore at this point include e.g.

    • Is this pattern due to random change or does the independent variable have significant explanatory power?
    • How can the relationship implied by the pattern be described (e.g. linear regression)?
    • How strong is the relationship (R2 value)?
    • What other variables might affect or explain the relationship (e.g. are these variable correlated because they are both correlated to another variable that is responsible for the mechanism?)
    • Does this relationship hold across subgroups of the relationship itself?

    Fitting trend lines helps elucidate patterns (especially in time series). This is generally the point where exploratory analysis transitions into a more formal analysis that involves statistical tests and fitting models. Keep in mind that we want to have a good grasp on whether our analysis is descriptive, inferential/predictive or causal/mechanistic in nature. Exploratory analysis is “quick and dirty”, you are mainly focused on summarizing the data and highlighting broad-scale patterns that emerge. It is useful for assessing the quality of our data, eye-balling if patterns are what you expect (was your hypothesis correct(ish)), and getting and idea of what the next steps in a more formal analysis should be.

    16.8 Step 6: Iterate your exploration

    Now that we have an answer(ish) to our question - the first thing we should do is question our answer12. Always consider limitations of your data, whether there are alternative explanations, if your data has some inherent issues you need to be aware of etc. It can also be helpful to find at least one external data source to assess whether your answers are roughly in line with other (expected) measurements.

  • 12 especially if your answer conforms to your a priori expectations!

  • It is called an exploratory analysis for a reason, because it is just the beginning. Usually this is the beginning of a more sophisticated analysis. Important questions to ask always include

    • Is your data appropriate for the question your are asking?
    • Do you need additional data to refine your answer?
    • Do you have the right question or do you need to refine your question?

    Exploratory Analysis frequently is more about hypothesis generating, rather than hypothesis testing. So you data exploration can be an important step to defining the precise (set of) question(s) you want to explore. Not infrequently, the exploration of an initial question leads you to generating additional (perhaps more interesting) questions.

    So for example, one of the first questions we might ask is whether the pattern we see with higher damages being incurred holds if we compare this to loss of life.

    Give it a whirl

    Transform your data set similar to the way we did it above to create a data set that contains the number of deaths observed for each disaster type in each year. Then use that new data frame to create bubble chart showing the change in the number of deaths per year for each disaster type.

    Did it!

    [Your answer here]

    Here is what the first few lines of your data set will look like:

    deaths <- disaster %>%
      filter(!is.na(total_deaths)) %>%
      group_by(year, disaster_type) %>%
      summarize(total_deaths_yr = sum(total_deaths))
    
    head(deaths)

    And here is what the plot should look like.

    ggplot(deaths, aes(x = year, y = disaster_type, size = total_deaths_yr/100000, fill = disaster_type)) +
      geom_point(shape = 21, alpha = .5) +
      scale_fill_viridis_d() +
      scale_size_continuous(breaks = c(.5, 1, 5, 10, 15, 30),
                            name = "100k deaths") +
      labs(x = "total deaths incurred", y = "disaster type",
           title = "Total deaths due to natural disasters (1900 - 2020)",
           subtitle = "The size of the bubble represents the total loss of life incurred by all events per year.",
           caption = "data: EMDAT/International Disaster Database") +
      theme_standard +
      theme(legend.position = "bottom") +
      guides(fill = FALSE,
             size = guide_legend(override.aes = list(fill = "black", alpha = 1)))
    Consider this

    Compare and contrast the total damages incurred per year with the total number of deaths due to each disaster type13.

    Then use what you have learned from the assigned readings and other media to discuss what could be causing the differences/similarities in these two figures.

    Conclude by making a statement about what we have learned in terms of whether natural disasters are indeed getting worse.

  • 13 A good way to do this is to briefly describe each figure to summarize the broad pattern(s), then point out what they have in common and what sets them apart

  • Did it!

    [Your answer here]

    Additionally we could break down regional patterns, or even pull in additional data of GDP of individual countries to compare costs incurred scaled to the economies of different countries.

    It’s always helpful to take notes as you go to jot down questions and also ideas on what might explain certain patterns you see.